feedback message
LILO: Bayesian Optimization with Interactive Natural Language Feedback
Kobalczyk, Katarzyna, Lin, Zhiyuan Jerry, Letham, Benjamin, Zhao, Zhuokai, Balandat, Maximilian, Bakshy, Eytan
For many real-world applications, feedback is essential in translating complex, nuanced, or subjective goals into quantifiable optimization objectives. We propose a language-in-the-loop framework that uses a large language model (LLM) to convert unstructured feedback in the form of natural language into scalar utilities to conduct BO over a numeric search space. Unlike preferential BO, which only accepts restricted feedback formats and requires customized models for each domain-specific problem, our approach leverages LLMs to turn varied types of textual feedback into consistent utility signals and to easily include flexible user priors without manual kernel design. At the same time, our method maintains the sample efficiency and principled uncertainty quantification of BO. We show that this hybrid method not only provides a more natural interface to the decision maker but also outperforms conventional BO baselines and LLM-only optimizers, particularly in feedback-limited regimes.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > North Rhine-Westphalia > Cologne Region > Bonn (0.04)
ProToM: Promoting Prosocial Behaviour via Theory of Mind-Informed Feedback
Bortoletto, Matteo, Zhou, Yichao, Ying, Lance, Shu, Tianmin, Bulling, Andreas
While humans are inherently social creatures, the challenge of identifying when and how to assist and collaborate with others - particularly when pursuing independent goals - can hinder cooperation. To address this challenge, we aim to develop an AI system that provides useful feedback to promote prosocial behaviour - actions that benefit others, even when not directly aligned with one's own goals. We introduce Pro-T oM, a Theory of Mind-informed facilitator that promotes prosocial actions in multi-agent systems by providing targeted, context-sensitive feedback to individual agents. Pro-ToM first infers agents' goals using Bayesian inverse planning, then selects feedback to communicate by maximising expected utility, conditioned on the inferred goal distribution. We evaluate our approach against baselines in two multi-agent environments: Doors, Keys, and Gems, as well as Overcooked. Our results suggest that state-of-the-art large language and reasoning models fall short of communicating feedback that is both contextually grounded and well-timed - leading to higher communication overhead and task speedup. In contrast, ProToM provides targeted and helpful feedback, achieving a higher success rate, shorter task completion times, and is consistently preferred by human users.
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)
- North America > United States > Massachusetts (0.04)
Improved Generalized Planning with LLMs through Strategy Refinement and Reflection
Stein, Katharina, Hodel, Nils, Fišer, Daniel, Hoffmann, Jörg, Katz, Michael, Koller, Alexander
LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains, we show that these extensions substantially improve (and never deteriorate) the quality of the generalized plans. In 12 of the domains, our best Python programs solve all tasks that can be generated with the respective instance generator.
- Europe > Germany > Saarland (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Europe > Denmark > North Jutland > Aalborg (0.04)
Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs
Feng, Wanyong, Tran, Peter, Sireci, Stephen, Lan, Andrew
The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.15)
- Europe > United Kingdom (0.14)
Automated Feedback in Math Education: A Comparative Analysis of LLMs for Open-Ended Responses
Baral, Sami, Worden, Eamon, Lim, Wen-Chiang, Luo, Zhuang, Santorelli, Christopher, Gurung, Ashish, Heffernan, Neil
The effectiveness of feedback in enhancing learning outcomes is well documented within Educational Data Mining (EDM). Various prior research has explored methodologies to enhance the effectiveness of feedback. Recent developments in Large Language Models (LLMs) have extended their utility in enhancing automated feedback systems. This study aims to explore the potential of LLMs in facilitating automated feedback in math education. We examine the effectiveness of LLMs in evaluating student responses by comparing 3 different models: Llama, SBERT-Canberra, and GPT4 model. The evaluation requires the model to provide both a quantitative score and qualitative feedback on the student's responses to open-ended math problems. We employ Mistral, a version of Llama catered to math, and fine-tune this model for evaluating student responses by leveraging a dataset of student responses and teacher-written feedback for middle-school math problems. A similar approach was taken for training the SBERT model as well, while the GPT4 model used a zero-shot learning approach. We evaluate the model's performance in scoring accuracy and the quality of feedback by utilizing judgments from 2 teachers. The teachers utilized a shared rubric in assessing the accuracy and relevance of the generated feedback. We conduct both quantitative and qualitative analyses of the model performance. By offering a detailed comparison of these methods, this study aims to further the ongoing development of automated feedback systems and outlines potential future directions for leveraging generative LLMs to create more personalized learning experiences.
- Oceania > Australia > Australian Capital Territory > Canberra (0.26)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > New York (0.04)
- (6 more...)
- Education > Educational Setting (1.00)
- Education > Assessment & Standards (1.00)
- Education > Curriculum > Subject-Specific Education (0.94)
- Education > Educational Technology > Educational Software > Computer Based Training (0.88)
Can Large Language Models Replicate ITS Feedback on Open-Ended Math Questions?
McNichols, Hunter, Lee, Jaewook, Fancsali, Stephen, Ritter, Steve, Lan, Andrew
Intelligent Tutoring Systems (ITSs) often contain an automated feedback component, which provides a predefined feedback message to students when they detect a predefined error. To such a feedback component, we often resort to template-based approaches. These approaches require significant effort from human experts to detect a limited number of possible student errors and provide corresponding feedback. This limitation is exemplified in open-ended math questions, where there can be a large number of different incorrect errors. In our work, we examine the capabilities of large language models (LLMs) to generate feedback for open-ended math questions, similar to that of an established ITS that uses a template-based approach. We fine-tune both open-source and proprietary LLMs on real student responses and corresponding ITS-provided feedback. We measure the quality of the generated feedback using text similarity metrics. We find that open-source and proprietary models both show promise in replicating the feedback they see during training, but do not generalize well to previously unseen student errors. These results suggest that despite being able to learn the formatting of feedback, LLMs are not able to fully understand mathematical errors made by students.
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- Education > Educational Setting (0.68)
- Education > Educational Technology > Educational Software > Computer Based Training (0.55)
Improving the Validity of Automatically Generated Feedback via Reinforcement Learning
Scarlatos, Alexander, Smith, Digory, Woodhead, Simon, Lan, Andrew
Automatically generating feedback via large language models (LLMs) in intelligent tutoring systems and online learning platforms has the potential to improve the learning outcomes of many students. However, both feedback generation and evaluation are challenging: feedback content has to be valid especially in subjects like math, which requires models to understand the problem, the solution, and where the student's error lies. Feedback also has to be pedagogically valid to reflect effective tutoring strategies, such as explaining possible misconceptions and encouraging the student, among other desirable features. In this work, we address both problems of automatically generating and evaluating feedback while considering both correctness and alignment. First, we propose a rubric for evaluating math feedback and show that GPT-4 is able to effectively use it to annotate human-written and LLM-generated feedback. Second, we propose a framework for feedback generation that optimizes both correctness and alignment using reinforcement learning (RL). Specifically, we use GPT-4's annotations to create preferences over feedback pairs in an augmented dataset for training via direct preference optimization (DPO). We show that our methods significantly increase the correctness and alignment of generated feedback with Llama 2, an open-source LLM, qualitatively analyze our generation and evaluation systems using case studies, and outline several areas for future work.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- Education > Educational Technology > Educational Software > Computer Based Training (0.68)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Setting > Online (0.48)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning
McNichols, Hunter, Feng, Wanyong, Lee, Jaewook, Scarlatos, Alexander, Smith, Digory, Woodhead, Simon, Lan, Andrew
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable form of assessment. An important aspect of MCQs is the distractors, i.e., incorrect options that are designed to target specific misconceptions or insufficient knowledge among students. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers, which has limited scalability. In this work, we explore the task of automated distractor and corresponding feedback message generation in math MCQs using large language models. We establish a formulation of these two tasks and propose a simple, in-context learning-based solution. Moreover, we propose generative AI-based metrics for evaluating the quality of the feedback messages. We conduct extensive experiments on these tasks using a real-world MCQ dataset. Our findings suggest that there is a lot of room for improvement in automated distractor and feedback generation; based on these findings, we outline several directions for future work.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
- (8 more...)
- Education > Educational Technology > Educational Software (0.68)
- Education > Curriculum > Subject-Specific Education (0.68)
- Education > Educational Setting (0.68)
Predicting Desirable Revisions of Evidence and Reasoning in Argumentative Writing
We develop models to classify desirable evidence and desirable reasoning revisions in student argumentative writing. We explore two ways to improve classifier performance - using the essay context of the revision, and using the feedback students received before the revision. We perform both intrinsic and extrinsic evaluation for each of our models and report a qualitative analysis. Our results show that while a model using feedback information improves over a baseline model, models utilizing context - either alone or with feedback - are the most successful in identifying desirable revisions.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- Africa > Kenya (0.04)
- (14 more...)
- Health & Medicine (0.68)
- Education > Educational Setting > K-12 Education (0.50)
Can artificial intelligence encourage good behaviour among Internet users?
Hostile and hateful remarks are thick on the ground on social networks in spite of persistent efforts by Facebook, Twitter, Reddit and YouTube to tone them down. Now researchers at the OpenWeb platform have turned to artificial intelligence to moderate Internet users' comments before they are even posted. The method appears to be effective because one third of users modified the text of their comments when they received a nudge from the new system, which warned that what they had written might be perceived as offensive. The study conducted by OpenWeb and Perspective API analysed 400,000 comments that some 50,000 users were preparing to post on sites like AOL, Salon, Newsweek, RT and Sky Sports. Some of these users received a feedback message or nudge from a machine learning algorithm to the effect that the text they were preparing to post might be insulting, or against the rules for the forum they were using.